The white wine dataset includes 11 different chemical attributes and 1 subjective quality score based on wine experts’ opinions, for a total of 12 data points for each of nearly 5,000 white wines.
The different chemical attributes are:
1. Fixed Acidity
2. Volatile Acidity
3. Citric Acid
4. Residual Sugar
5. Chlorides
6. Free Sulfur Dioxide
7. Total Sulfur Dioxide
8. Density
9. pH
10. Sulphates
11. Alcohol
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Looking over the summaries, it would seem I’m running into a ton of far outliers. Many of the variables’ 1st and 3rd quartiles are much further away from min and max respectively, than they are from one another (telling me most of the information is in the center there, but some points absolutely are not).
As far as charts go, the first thing I was really interested in seeing was a set of histograms so I can see each variable. This gives me a good idea of what I want to look into next. A few, such as citric acid and chlorides, have very long tails. Some others, like residual sugar and density, don’t have any visible high values but the charts are still formatted as though they do, so I’d like to see what’s going on there.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.100 6.300 6.800 6.839 7.300 9.100
True to form! Fixed Acidity has extreme outliers on both ends. I tried the same plot while excluding ONLY the highest value data point, which helped, but I wanted to see how it would turn out after lopping percentages. Running the chart again while including only the central 98% looks like it makes a lot more sense.
The final summary is of that chunk of 98% of the data; the min and max have obviously changed drastically, but three of the four central stats are identical to those from the full set, and the fourth is extremely close (off by 0.016). I get concerned about how comfortable I feel excluding outliers, so seeing those summaries acts as validation in a sense.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
A lot of these variables are in the same boat in that they have one or a few extreme outliers. Log10 helps some of them look like a workable data model, but it’s definitely not always a fix.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
There’s a bit of a tail, but honestly it doesn’t seem significant. If I had to guess, that spike right before 0.5 is some sort of secondary standard (like the standard if the citric acid content is higher), so whether that’s from rounding or actual measurements, the amount doesn’t seem concerning.
What I just wrote was a valid explanation in my opinion, but I was still writing it partially to convince myself I don’t need to try log10. I did wind up trying and the table turned out far worse; results are below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03889 0.04638 0.04884 0.05571 0.22432
I got it in my head that getting this percentage (as well as one a little ways down the document), that is how much of the Fixed Acidity is Citric Acid, was potentially important and could show me some unexpected insight. At this point, I haven’t learned anything from it except that Citric Acid very rarely makes up more than 10% of a white wine’s Fixed Acidity.
Note from the future: I’d read that citric acid is included in a wine’s fixed acidity, but looking at the dataset information again, I’m pretty sure the fixed acidity is just tartaric acid; it’s not very clear between the variable and the description found below. Anyway if the fixed variable does NOT include citric, my whole idea behind this one is out the window.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
## , , volatile.acidity = 0.965, citric.acid = 0.6, residual.sugar = 65.8, chlorides = 0.074, free.sulfur.dioxide = 8, total.sulfur.dioxide = 160, density = 1.03898, pH = 3.39, sulphates = 0.69, alcohol = 11.7, quality = 6, pct.citric = 0.0769230769230769
##
## fixed.acidity
## X 7.8
## 2782 1
The highest value in this set is over double the value of the second highest. I somewhat expected it to be the same record as the Sulfur Dioxide anomaly (it’s further down, I did some ill-fated rearranging), but it’s a different one.
Excluding that top 1% gives a way better picture of the data, and it looks rife for a logging. Residual Sugar after a log10 transformation almost looks bimodal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Chlorides need to be cut down pretty heftily to look at the bulk of the data. I had to take 5% off the top (after trying 1, 2, 3, and 4%). The 1st and 3rd quartiles only have a value difference of 0.014, but the maximum value is almost seven times larger than the 3rd quartile, 0.339 more by value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## 'data.frame': 1 obs. of 14 variables:
## $ X : int 4746
## $ fixed.acidity : num 6.1
## $ volatile.acidity : num 0.26
## $ citric.acid : num 0.25
## $ residual.sugar : num 2.9
## $ chlorides : num 0.047
## $ free.sulfur.dioxide : num 289
## $ total.sulfur.dioxide: num 440
## $ density : num 0.993
## $ pH : num 3.44
## $ sulphates : num 0.64
## $ alcohol : num 10.5
## $ quality : int 3
## $ pct.citric : num 0.041
My dataset information has the following to say about free sulfur dioxide:
at free SO2 concentrations over 50 ppm, > SO2 becomes evident in the nose and taste of wine
If one of these actually does measure at 289, that’s well beyond the realm of reason (also illegal in many places if it’s actually white wine and not sweet wine). I’m ignoring this one for this variable, but I’m curious to see how it fares in the other categories. The second highest value is 146.5, still well beyond what’s expected; in total, 17 values are above 100. For now, I’ll just remove the single highest, but I’ll keep those others in mind as I continue.
Again, I’m not positive what I’ll wind up doing with these, but I’ll likely want to check the oddly high values against quality later.
Predictably, Total Sulfur Dioxide follows the same pattern.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02362 0.19093 0.25368 0.25558 0.31579 0.71053
I wanted to see the ratio of Free Sulfur Dioxide to Total Sulfur Dioxide; obviously there’s no guaranteed correlation between this new variable and any others, but it seems there possibly could be. Laid out in a histogram, this new percentage variable follows the same pattern as most of the others, but the upper outliers aren’t quite as extreme as many. I did try a log10 transformation, but that just turned the data negatively skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The summary makes it clear that Density falls squarely in the “extremely high outliers” camp. I didn’t have much luck with log10, but shaving off just 0.3% from each end made a huge difference.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Really the only thing I’d potentially do to change pH is cut off those outliers on both ends, but I’m not even convinced it’s necessary at this point. Additionally, if I’m comparing pH with other variables, I’ll likely be paring the set a bit already.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Without removing any data points from Sulphates, it can be taken from a normal distribution with a long tail to a pretty normal distribution using log10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
##
## 8 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7
## 2 3 9 23 78 107 95 185 144 199 134 229 231 131 107
## 9.8 9.9 10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 11 11.1 11.2
## 137 109 163 116 133 85 153 163 117 97 135 90 162 86 112
## 11.3 11.4 11.5 11.6 11.7 11.8 11.9 12 12.1 12.2 12.3 12.4 12.5 12.6 12.7
## 106 127 89 49 60 63 56 102 53 89 63 68 83 63 56
## 12.8 12.9 13 13.1 13.2 13.3 13.4 13.5 13.6 13.7 13.8 13.9 14 14.1 14.2
## 57 41 36 20 14 7 20 12 10 7 2 3 5 1 1
Based on what I’m seeing here, alcohol content seems to have a general downward slope. There are certainly higher points along the way, but many of those would be at numbers ending in .0 or .5 (for example: 11.9 has a count of 56; 12 has a count of 102; 12.1 has a count of 53), so that looks more rounding related. 9.5 is the value for the 1st quartile and also the mode. 9.4 would be the mode if not rounding to the tenth, only by a count of 1.
My delve into Alcohol will almost definitively benefit from binning later on, possibly by ones, but most likely by halves (8:8.4, 8.5:8.9, etc).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Clearly 6/10 is the most common quality rating, followed by 5, then 7. The reason I took an actual count was to see if 4 or 8 was higher, and it’s 8. No white wines received a score below 3 or above 9 (and only 5 got a score of 9).
The white wine dataset holds 12 data points (technically 13, but one is just a record-identifying integer) for each of 4,898 records. 11 of these variables are numbers, the identifier is an integer, and the quality rating is an integer. That being the case, the values are pretty straight-forward without any specialized ranking/ordering.
Some off the bat stuff I’ve seen:
Alcohol content looks so organized with what looks like natural bins, but it doesn’t resemble anything else nor am I very confident there IS a correlation with any other factors.
Quality is usually ‘average’ at a rating of 6/10. The ratings go neither below 3 nor above 9, and both of those are rare. This may make it more difficult to find any possible correlation with other factors, as there are so few data points on the high or low ends.
Wines with Residual Sugar over 45 are considered ‘sweet’, which I believe drops them from the ‘white’ category; I only have one above 45 at 65.8. This record (2782) is also extremly high in Volatile Acidity and Density, and on the high end of several other categories. This could point to a correlation between Residual Sugar and Volatile Acidity and/or Residual Sugar and Density, or it could mean nothing, but luckily that will be easy enough to find out.
Volatile Acidity, Citric Acid, and Free Sulfur Dioxide are the three variables the dataset information specifically mentions as affecting taste (Citric Acid in a positive manner and the other two negatively). It stands to reason these three variables should most heavily affect Quality, which is publicly the most important factor here.
I know from personal experience that Alcohol can affect the taste of a wine, as can Residual Sugar. Apart from their potential relation to Quality, I’d also like to check those two against one another, as sugar is ‘eaten’ during the fermentation process, so high Residual Sugar should indicate high Alcohol and/or that fermentation was halted before it could finish.
I added both ‘pct.citric’ and ‘pct.free’.
pct.citric refers to Citric Acid as a percentage of the total Fixed Acidity. Citric Acid is already calculated into that total, so the calculation is simple division. The main reason for creating this variable is that citric acid is said to, in small quantities, ‘add ’freshness’ and flavor to wines’ (according to the dataset information). I want to see if this is actually true, or at the very least if there’s any correlation at all between the Citric Acid, the Percent Citric Acid, and the Quality.
pct.free refers to Free Sulfur Dioxide as a percentage of Total Sulfur Dioxide, again just a simple division operation. Too much Free Sulfur Dioxide in particular (as opposed to the Total Sulfur Dioxide) is said to be the potential problem when it comes to taste, but as with the Citric Acid, I’d like to see if the ratio actually seems to come into play.
Many of my variables hold extremely high outliers. Fixed Acidity, Volatile Acidity, Citric Acid, Residual Sugar, Chlorides, Free Sulfur Dioxide, Total Sulfur Dioxide, and Density all have seemingly unreasonably high outliers. Each of those can be ‘fixed’ by shaving off a very small percentage from the highest values.
A couple others (pH and Sulphates, and Volatile Acidity kind of falls in the previous camp as well as this one) have very long tails; while it’s the same problem to an extent (high outliers), this means there are MORE of those outliers, so if I want to see anything useful in the data, I need to remove a higher percentage, which isn’t exactly ideal.
Alcohol is very interesting. I’d mentioned I think it’s likely more of a downward slope based on rounding, but I could be wrong. The data generally looks like there are three or four tiers of alcohol content, each its own little hill, so I’ll likely be separating those values into three or four groups based on those values during further analysis.
I was really excited to run this and see what I can see! I tried ggpairs, but the outliers made a really negative impact; while deciding the best way to go about removing them from the matrix, I found out about ‘psych’ from the sample project and very happily switched over.
Fixed Acidity and Volatile Acidity (Citric Acid variables are discussed below) simply don’t seem to correlate with anything. I did create a total.acidity variable, summing up Fixed and Volatile, but it still showed me nothing; Volatile Acidity is present in too small of volumes to really impact the total. I also tried using the ratio of Volatile to Fixed to no avail. Acids just seem to exist in a vacuum when it comes to white wine. Percent Free Sulfur Dioxide looks like a useless metric as well; I was excited about ratios, thinking it made sense that they’d affect other aspects, but they offered nothing of value.
I previously did get stuck in the ‘correlation coefficient’ trap, in that I was looking at those discouraging scores and discounting a lot of possibilities due to it. This chart’s highest scores are between Density and both Sugar/Alcohol, but those are used in measuring one another, so if anything those scores should actually be higher.
## [1] 0.9371852
I was a bit attached to the Citric Acid metrics, the original citric.acid as well as my created pct.citric; I’m a big fan of citrus overall, and it’s supposed to help with flavor and add ‘freshness’. I wound up removing them from my pairs matrix as the Citric Acid to Quality correlation was only -0.019, and the only other correlation score worth a second glance was somewhat internal, between Citric Acid and Percent Citric Acid at 0.9372.
That did surprise me actually, as it indicates this indicates a standalone quality. What I mean is it seems Citric Acid will or won’t be added, regardless of any other component. As more Citric Acid is added, it makes up a higher percentage of Fixed Acidity, showing it’s not added measured as any sort of proportion to another component. It’s not a drastic percentage change, but it’s extremely clear.
Note! Now that I’m under the impression Citric Acid is NOT counted into that Fixed Acidity number, pct.citric would have to be changed from citric.acid / free.acidity to citric.acid / (free.acidity / citric.acid). I made the new variable and ran the correlation for it and it’s only about 0.007 lower at 0.9365.
As Quality is what I’d like to be able to predict here, I wanted a quick glance at each variable interacting with it. After looking at this initially, I added red mean lines on charts that looked like they may have anything near a steadily rising or lowering path between a Quality rating of 5 and 9 (inclusive). 3 of them at least seemed like a long shot.
Narrowing down the compared variables as well as cutting 3 and 4 from the x-axis helped to see these more clearly, but if anything I’m just less confident in all of them now, mostly so the ones I found to be unlikely in the first place (Citric Acid, Residual Sugar, Total Sulfur Dioxide, & pH; ie most of them).
Not displayed are the results of 6 different code cells all computing R^2 values, with various modifiers and methods. First are the 6 variables that potentially looked promising, plus Alcohol, with each variable’s worst outliers excluded, including a Quality score of 5 or higher. I left Alcohol out of the 6-plot grid as I already knew Quality’s highest correlation was with Alcohol.
R^2 with Quality above 4
Citric Acid -> 0.000631755
Residual Sugar -> 0.01804341
Chlorides -> 0.08302848
Total Sulfur Dioxide -> 0.04429476
Density -> 0.1188008
pH -> 0.01226858
Alcohol -> 0.2186184
The remaining cells contain formulas for the summaries and R^2 values using all Quality scores, using a subset with collective limits in place as opposed to limits per variable, using the entire set without limits barring Quality, using the entire set without limits including Quality, tranforming variables with log10, adding the rejected variables back into the mix, and different combinations of those. Essentially I was desperately trying to find more useful R^2 values with Quality. I didn’t find anything useful in all those hidden results, but at least I won’t be left wondering.
## [1] 0.4675664
With an unimpressive score of 0.43, Quality’s greatest correlation co is to Alcohol. Doing the initial scatterplot wasn’t very well thought-out, but after switching to the box plot, this looked very promising. Starting at the Quality score of 5 while focusing on the medians, there seems to be a fantastic positive correlation, but the spread on the boxes is pretty extreme. The correlation between Alcohol and Quality, barring Quality values below 5, is 0.47 so there’s only a change of +0.04.
While statistically this isn’t particularly impressive, the boxplot paints a pretty drastic picture.
## [1] 0.03488758
## [1] -0.04945677
Theoretically, Free Sulfur Dioxide and Quality should have a fair correlation, as FSD specifically causes a bad taste in wine. The first plot seems counter- intuitive as the FSD measurement actually increases with quality overall, but a closer look at the dataset information tells me 50 mg/dm^3 is when FSD starts having a negative impact. The horizontal line at 50 demonstrates that the bulk of the data, including the entirety of all boxes on the plot, falls below the ‘bad taste’ threshold.
Taking that into account and excluding values below 50, the plot looks like it makes more sense, with the medians and means steadily decreasing (except the tiny jump from the median of 6 to that of 7) but the correlation is still basically non-existent at -0.049; at least it’s negative now though.
As it stands, the Quality of any given white wine really doesn’t seem to be predictable based on the other values. As Alcohol is the only variable with even a moderate correlation to Quality, I wonder if it can be predicted based on the other values. Density does have the second highest correlation (this one is negative) to Quality, and also is closely related to Sugar and Alcohol, so I’ll be looking at that one as well.
##
## Pearson's product-moment correlation
##
## data: residual.sugar and alcohol
## t = -36.222, df = 4846, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4834565 -0.4391437
## sample estimates:
## cor
## -0.461588
## # A tibble: 4 x 3
## alcohol.bucket3 mean.sugar median.sugar
## <fct> <dbl> <dbl>
## 1 (7.9,9.5] 9.98 10.6
## 2 (9.5,10.4] 6.19 6
## 3 (10.4,11.4] 4.42 2.5
## 4 (11.4,14.3] 4.12 2.9
## [1] 0.2130635
Note to grader! One of the things I was supposed to fix is the box plots showing Alcohol against Residual Sugar, as both are continuous values. However the grading notes did say what I need to do is transform the x-axis variable into categorical with bins/buckets, but that’s exactly what this grid is: the box plots with x-axes of Alcohol as cut into bins using 3 different sets of separation points. I added the buckets as the color values in case that’s what the grader meant, but I’m not sure overall.
I’ve got multiple bucket configurations here because I really didn’t know what the best route would even be regarding splitting up Alcohol. The 3rd one, the one using numbers from a statistical summary, seems like it’s the most valid as it’s not based on my likely-misguided estimates of ‘hill’ size or just evenly spaced (which takes parameters into account but totally ignores any aspects of the actual data contained within). I made the buckets in the first place in order to try finding some hidden relationships; the Alcohol histogram somewhat looks like it has four ‘hills’ to it so cutting it up seemed to have at least potential to help.
These are just four more plots based on Alcohol and Residual Sugar. I was transforming data, swapping axes, adding different lines, etc. trying to find a new way to see the relationship and I just didn’t learn anything from it.
## [1] 0.4918294
In all honesty, I included both the charts with and without outliers for two reasons. First reason is because this is a particularly dramatic demonstration of the difference outliers can make visually; second reason is because the Including Outliers plot made me laugh heartily and unexpectedly.
## [1] 0.4918294
## [1] 0.2509126
The correlation and R^2 values between Density and Chlorides are actually pretty promising, especially as Alcohol and Sugar are the components intertwined with Density measurement (meaning this could bring me up to 3 helper variables for Density).
## [1] 0.6689685
## [1] 0.6475119
## [1] 0.8132601
## [1] 0.9097275
The first two values displayed are R^2 of Density to Residual Sugar and R^2 of Density to Alcohol. I wanted to try comparing a new variable to Density: residual.sugar - alcohol. The 3rd R^2 value displayed is the result of this new variable and Density at a pretty impressive 0.8132 (with a correlation score of 0.91).
Two of many plots made while attempting to transform Sugar and Alcohol into one variable helpful with Density.
Plotting the new residual.sugar - alcohol variable against Density. There are a whole mess of points down around -10 there because all the low Residual Sugar values, but overall this is probably the most valid (in terms of helping me proceed) plot I’ve made so far.
Focusing on Quality is fairly frustrating as my highest R^2 value with it is Alcohol at 0.1897; correlation score is 0.43. The second highest is Density at R^2 of 0.1188. The only way I can think to keep pursuing Quality is to focus on Alcohol and Density; if I can learn more about those two, maybe that can lead to a higher understanding of Quality.
I tried using log10s and square roots, limiting data at varying quantiles, entirely lopping two ranks out of only seven total from Quality, going through the processes with variables that seemed irrelevant (those stayed irrelevant), just all sorts of things to no avail.
Density has a hefty relationship with both Alcohol and Residual Sugar, as it should; Density is measured using both those values, so a high correlation should be expected. Density also has a fair relationship to Chlorides with a cor value of 0.4918.
Alcohol has a fair relationship with several variables, but honestly I wanted to wait until the multivariate section to get into it (I tried more than once to start it bivariate but kept wanting to add more variables). Sugar and Alcohol only have an R^2 score of 0.213, but I do wonder if there’s something more between them through some sort of transformation.
Density ~ Chlorides has an R^2 score of 0.4918, which definitely surprised me as the information I’ve gotten regarding Density states it as a calculation related to Alcohol and Sugar, specifically.
Sugar, Chlorides, and Sulfur Dioxide all have scores between 0.27 and 0.41 with one another. I honestly don’t know if I find that surprising because while low, maybe there’s something there; or if I find it surprising because I would expect those cor scores to be higher. For example, 0.27 for Residual Sugar and Chlorides: it would make a lot of sense to me if that was much higher, as I wouldn’t be surprised to see the balance between sugar and salt more standardized across a specific category of wine.
I don’t know too much about wine, but I do know that along with the looking for unique qualities, some characteristics are expected/demanded by people who know more about wine than I do. Salt and sugar are both very distinct and scrutinized culinary components; pretty much anything can be over/under salted or sweetened. It could be that I’m thinking too much about food and it just doesn’t work the same way with wine, or maybe one or both amounts are just so miniscule that they don’t actually affect the taste much. I’m sure I could get more insight by researching this, but looking through this data is what put the idea in my head in the first place so it seemed right to mention it.
After messing around with Residual Sugar and Alcohol for quite a while, I wanted to try residual.sugar - alcohol as a variable. Unfortunately I don’t have any sort of an ‘I have heavy statistics knowledge’ reason as to why, but since Alcohol’s correlation to Density and Sugar’s correlation to Density even out to nearly zero when added together, it made sense to me to subtract the variable with the negative cor value (Alcohol) from that with the positive. Running this new variable against Density gave me a correlation score of 0.91.
I was so excited about this that I erased the relevant code and wrote it again without notes in order to see if I would get the same results. After that, I searched around and found a different method for calculating the same stats so I would have even more confirmation.
The beginning of the multivariates section contains the code that creates my ‘mm’ variables, which are explained after the next grid. I also set the theme darker so the lighter colors can still be seen properly.
The first plot fits the data okay since Residual Sugar and Alcohol’s values overlap, but as Alcohol has a smaller value range, it’s smooshed in the center. I tried multiple methods to make it fit more nicely (different trans commands, scaling attempts, cutting more data, etc.) before switching gears and finding the min-max scale formula. I made a new variable each for Alcohol and Residual Sugar by passing their corresponding values from my outlier-free datafram through: (x-min(x)) / (max(x)-min(x)) into new columns.
This allows values to be spread to scale from their original into a range of 0 - 1, creating a y-axis along which different values can co-exist in a manner that’s far easier to interpret. The second plot was essentially what I’d been seeing in my head, my goal, so that’s very exciting. The second one also very clearly demonstrates the relationships between Density and each other variable. As Density increases, Residual Sugar increases and Alcohol decreases.
WOW. I added the min-max scaled Chlorides variable and it’s just everywhere. I can see that it is positive (before I added the line), as the upper left and bottom right corners are mostly empty, but especially as compared to the pre-existing plot data, Chlorides doesn’t look at all impressive plotted against Density.
I created a couple variables dividing Alcohol into buckets of 1% each (the final bucket was almost empty so I combined 2%).
While there aren’t many datapoints at Quality 9, it’s still comforting to see its large spike at a high Alcohol level; kind of the same principal, but 4 has one big spike, down at the lower end of Alcohol content. 8 also has a spike up at the high end there, reinforcing the fair correlation between Quality rating and Alcohol content.
Ah yes. This is far easier for me to understand than the scatterplots. Density and Sugar, I’ve already worked with specifically, and they both showcase their negative correlation to alcohol well. Although I know what the Alcohol histogram looks like, this form drives it home better for me, as 3 of the 6 bins are heavily on the lowest 1/3 or so of density. Residual Sugar is confusing me. Almost all the bins are vastly in the lowest 1/4, the lowest Alcohol bin is overall higher in sugar content, but the second lowest bin, 8.9 - 9.9, has bumps in quarters 1, 2, and 3. I just want to know why that Alcohol % specifcally is spread relatively evenly across Residual Sugar.
Total Sulfur Dioxide is only really interesting to me here because the lowest Alcohol is to the left of the second lowest, the only one out of order from highest to lowest. Chlorides is essentially the same story, in order from strongest to weakest, apart from the lowest two buckets, which are switched.
An interesting common thread between all four of these is that near-exclusively, as Alcohol content decreases, the range along each x-axis increases. I know Alcohol has a lot more values in those lower buckets than the higher ones, but I also wonder if it’s something more than just higher counts.
These plots are fun, and pretty good at conveying information from where I’m sitting, but I’m not particularly picking up anything new here. Quality scores tend to be higher with Alcohol content, not very but noticably negatively correlated with Chlorides, and a bit negatively correlated with Total Sulfur Dioxide.
Seeing Quality in the manner I did really made the rarity of higher scores stand out. I already knew the numbers and had see it with the histogram, but seeing so few red dots scattered amid all this color just seemed odd (valid, but odd).
I feel like I should have realized before, but seeing the four density plots together very strongly showed a tendency for the data points to spread as the Alcohol content gets lower, indicating more of a uniformity within the chemical makeup of higher volume white wines.
Nope, didn’t do it! I could hardly find anything much related to anything, I was nowhere near being able to attempt a model. I went into the project wanting to do so but it felt like the more I learned, the less able I’d have been to do so.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Each of these histograms shows Density within the white wine dataset. The first one includes all the values, while the second includes the central 99.4% of values. By removing that 0.6%, the data becomes far easier to subsequently work with.
As both Residual Sugar and Alcohol have high correlations with Density, it could be helpful to see them both plotted against it, on scale with one another. Using a min-max calculation achieves this and the plot shows two clear relations to Density.
Alcohol has a higher correlation to Quality than Quality has to any other variable. While multiple scores show a weaker pattern, the two large spikes over high Alcohol levels are both high Quality scores, while the spike over low Alcohol levels is a low Quality score, helping to reinforce the correlation.
Overall I had a lot of fun going through the data, but honestly it did get very frustrating. I have a suspicion that I was trying to do too much right from the start (not too many plots or something like that, but trying to ‘solve’ this in some way), which made it pretty discouraging when I finally accepted I won’t be able to come up with some sort of model or strong input here. I started kind of giving up during the multivariate section because I truly didn’t know where else to go from what I’d done. It’s very possible that I just got myself stuck on some arbitrary track and blinded myself to other potentials.
Realizing that the fairly weak connection between Alcohol and Quality was my best chance at finding actionable insight was daunting, but at the same time I did laugh. Of course the ratio of alcohol would be the predominate factor in how well the wine is ranked.
Logistically, I really enjoyed working with R overall; it’s a pretty intuitive language and usually not too cumbersome. I do get frustrated regarding packages. While they allow you to do so much more, they very frequently deviate from how R is generally laid out and the syntax involved. I also ran into frequent problems installing and updating packages, as well as finding a solution to a question I had then learning the package involved is defunct.
With the data at-large, almost every column had 1 unreasonably extreme outlier. Many of those columns, as well as some not included there, had several or more heavy outliers. Taking percentages off the top and/or bottom worked very well but I continually felt like I wasn’t allowed to do that, like I was breaking some cardinal data rule. Before taking the R course, I’d always been put in a box(plot, zing) regarding data; a homework or test question would feasibly have many valid answers, but there’s only one accepted, so it’s hard for me to break out of the restricted mindset and back into how I naturally think. At the end of the day, I feel like many of the measurements in this dataset weren’t even valid, but I’m not confident in claiming that has nothing to do with my own insecurities and inexperience.